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Abstract 

The popular Lasso approach for sparse estimation can be derived via marginalization of a joint 
density associated with a particular stochastic model. A different marginalization of the same 
probabilistic model leads to a different non-convex estimator where hyperparameters are optimized. 
Extending these arguments to problems where groups of variables have to be estimated, we study 
a computational scheme for sparse estimation that differs from the Group Lasso. Although the 
underlying optimization problem defining this estimator is non-convex, an initialization strategy 
based on a univariate Bayesian forward selection scheme is presented. This also allows us to define 
an effective non-convex estimator where only one scalar variable is involved in the optimization 
process. Theoretical arguments, independent of the correctness of the priors entering the sparse 
model, are included to clarify the advantages of this non-convex technique in comparison with 
other convex estimators. Numerical experiments are also used to compare the performance of these 
approaches. 

Keywords: Lasso; Group Lasso; Multiple Kernel Learning; Bayesian regularization; marginal 
likelihood 



1. Introduction 

We consider sparse estimation in a linear regression model where the explanatory factors 6 G W" 

are naturally grouped so that is partitioned as = 0^^) ... O^^^ In this setting 

we assume that 6 is group (or block) sparse in the sense that many of the constituent vectors 6^'^ are 
zero or have a negligible influence on the output j G M". In addition, we assume that the number of 
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unknowns m is large, possibly larger than the size of the available data n. Interest in general spar- 
sity estimation and optimization has attracted the interest of many researchers in statistics, machine 
learning, and signal processing with numerous applications in feature selection, compressed sens- 
ing, and selective shrinkage ( Hastie and Tibshirani[ 1990 TibshiranT}|1996[ Donoho[ 2006; Candes 



and Tao[ |2007| l. The motivation for our study of the group sparsity problem comes from the "dy- 
namic Bayesian network" scenario identification problem as discussed in ( [Chiuso and Pillonetto| 
2010b|a ]). In a dynamic network scenario the "explanatory variables" are often the past histo- 
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ries of different input signals with the "groups" 0*^'' representing the impulse response^describing 
the relationship between the i-th input and the output y. This application informs our view of the 
group sparsity problem as well as our measures of success for a particular estimation procedure. 

Several approaches have been put forward in the literature for joint estimation and variable se- 
lection problems. We cite the well known Lasso (Tibshirani 1996 1, Least Angle Regression (LAR) 
(Efron et^m 2004 1, their "group" versions Group Lasso (GLasso) and Group Least Angle Regres- 



sion (GLAR) (Yuan and Lin 2006 1, Multiple Kernel Learning (MKL) (Bachetal. 2004; Evgeniou 



et al. 2005 [ Pillonetto et al. 2010). Methods based on hierarchical Bayesian models have also been 



considered such as Automatic Relevance Determination (ARD) (i Mackay[ 1994 1, the Relevance 
Vector Machine (RVM) ( Tipping[ 2001 1, and the exponential hyperprior in ( [Chius o and Pillonetto 



2010b[ 201 1[ |. The Bayesian approach considered in ( [Chiuso and Pillonetto[ 2010b, 201 1|) and fur- 



ther developed in this paper is intimately related to (fMackay 1994; Tipping[ [2001 ); in fact, the 
exponential hyperprior algorithm in ( [Chiuso and Pillonetto, |2010b[ [201 1[ ) is a penalized version of 
ARD. A variational approach based on the golden standard spike and slab prior, also called two- 
groups prior ( Efron[ 2008 1, has been also recently proposed in ( Titsias and Lzaro-Gredilla[ 201 1 1. 

An interesting series of papers ( [Wipf and Rao[ [2007[ [Wipf and Nagarajan] [2007[ [Wipf et aL 



201 1[ ) provide a nice link between penalized regression problems like Lasso, also called type-I meth- 



ods, and Bayesian methods (hke RVM ( ,Tipping[[2001[ ) and ARD (|Mackay|[T994|)) with hierarchical 
hyperpriors where the "hyperparameters" are estimated via maximizing the marginal likelihood and 



then inserted in the Bayesian model following the Empirical Bayes paradigm (Maritz and Lwin 



1989 1; these latter methods aie also known as type-II methods ('Berger[ 1985 1. Note that this Empir- 



ical Bayes paradigm has also been recently used in the context of System Identification (Pillonetto 
and De Nicolaol[20T0l[Pirionetto et al.[[20T]][Chen eTaLl[20TT] ). 



In (Wipf and Nagarajan 2007[ Wipf et al. 2011 1 it is argued that type-II methods have ad- 
vantages over type-I methods; some of these advantages are related to the fact that, under suitable 
assumptions, the former can be written in the form of type-I with the addition of a non-separable 
penalty term (a functiong(xi, is non-separable if it cannot be written as g(xi , ..,x„) = Y!i=i h{xi)). 
The analysis in ( Wipf et aL|[2011[ ) also suggests that in the low noise regime the type-II approach re- 
sults in a "tighter" approximation to the norm. This is supported by experimental evidence show- 
ing that these Bayesian approaches perform well in practice. Our experience is that the approach 
based on the marginal likelihood is particularly robust w.r.t. noise regardless of the "correctness" of 
the Bayesian prior. 

Motivated by the nice performance of the exponential hyperprior approach introduced in the 
dynamic network identification scenario ( Chiuso and Pillonetto[ [2010b[ 2011[ ), we provide some 
new insights clarifying the above issues. The main contributions are as follows: 



1 . An thus may, in principle, be infinite dimensional. 
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(i) in the first part of tlie paper we discuss tiie relation among Lasso (and GLasso), the Expo- 
nential Hyperprior (HGLasso algorithm hereafter, for reasons which will become clear later 
on) and MKL by putting all these methods in a common Bayesian framework (similar to 
that discussed in ( [Park and Casella 2008 1). Lasso/GLasso and MKL boil down to convex 
optimization problems, leading to identical estimators, while HGLasso does not. 

(ii) All these methods are then compared in terms of optimality (KKT) conditions and trade- 
offs between sparsity and shrinkage are studied illustrating the advantages of HGLasso over 
GLasso (or, equivalently, MKL). Also the properties of Empirical Bayes estimators which 
form the basis of our computational scheme are studied in terms of their Mean Square Er- 
ror properties; this is first established in the simplest case of orthogonal regressors and then 
extended to more general cases allowing for the regressors to be realizations from, possibly 
correlated, stochastic processes. This, of course, is of paramount importance for the system 



identification scenario studied in (Chiuso and Pillonetto 2010b 201 1 1. 



Our analysis avoids assumptions on the correctness of the priors entering the stochastic model 
and clarifies why HGLasso is likely to provide more sparse and accurate estimates in com- 
parison with the other two convex estimators. As a byproduct, our study also clarifies the 
asymptotic properties of ARD. 

(iii) Since HGLasso requires solving non-convex, and possibly high-dimensional, optimization 
problems we introduce a version of our computational scheme which can be used as an initial- 
ization for the full non-convex search requiring optimization with respect to only one scalar 
variable representing a common scale factor for the hyperparameters. Such Bayesian schemes 
with a hyperprior having a common scale factor, or more generally group problems in which 
each group is described by one hyperparameter, can be seen as instances of "Stein estimators" 
( |James and Stein, ,196T| |Efron and Morris, ,1973 Stein 1981) and have close connections to 
the non-negative garrote estimator (Breiman 1995| l. The initialization we propose is based 
on a selection scheme which departs from classical Bayesian variable selection algorithms 



([George an d McCulloch[[T993t[GeOTge and Foster} [2000t [Scott and Berger][20T0| ). These lat- 
ter methods are based on the introduction of binary (Bernoulli) latent variables. Instead our 
strategy involves a "forward selection" type of procedure which may be seen as an instance of 



the "screening" type of approach for variable selection discussed in (Wang 2009 1; note how- 
ever that while classical forward selection procedures work in "parameter space" our forward 
selection is performed in hyperparameter space through the marginal posterior (i.e. once the 
parameters 6 are integrated out); in the asymptotic regime this procedure is equivalent to per- 
forming forward selection using BIG as a criterion. This "finite data" Bayesian flavor seems 
to be a key feature which makes the procedure remarkably robust as the experimental results 
confirm. Note that backward and forward-backward versions of this procedure have also been 
tested with no notable differences. 

(iv) Extensive numerical experiments involving artificial and real data are included which confirm 
the superiority of HGLasso. 



The paper is organized as follows. In Section|2]we introduce the Lasso approach in a Bayesian 
framework, as well as another estimator, namely HLasso, that requires the optimization of hyper- 
parameters. Section [3] extends the arguments to a group version of the sparse estimation problem 
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Figure 1: Bayesian networks describing the stochastic model for sparse estimation (a) and group 
spai^se estimation (b) 



introducing GLasso and the group version of HLasso which we call HGLasso. In Section[4]the rela- 
tionship between HGLasso and MKL is discussed, reviewing the equivalence between GLasso and 
MKL. Section [5] clarifies the advantages of HGLasso over GLasso and MKL on a simple example. 
In Section |6] the Mean Squared Error properties of the Empirical Bayes estimators are studied, in- 
cluding their asymptotic behavior. In Section|7]we discuss the implementation of our computational 
scheme, also deriving a version of HGLasso that requires the optimization of an objective only with 
respect to one scalar variable. Section [8]reports numerical experiments involving artificial and real 



data, also comparing the new approach with the adaptive Lasso described in (Zou 20061. Some 
conclusions end the paper. 

2. Lasso and HLasso 

Let 6 = [di 02 • • • dmV be an unknown parameter vector while j S M" denotes the vector contain- 
ing some noisy data. In particular, our measurements model is 

y = Gd+v (1) 

where G £ R"^™ and v is the vector whose components are white noise of known variance a^. 

2.1 The Lasso 

Under the assumption that 6 is sparse, i.e. many of its components are equal to zero or have 
a negligible influence on y, a popular approach to reconstruct the parameter vector is the Lasso 
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( Tibshirani] 1996 1. The Lasso estimate of 6 is given by 

(y-GdV(y-Gd) ^, , 

Gl = argmin ^ ^ '- + 7^ £ (2) 

e za 

where Jl G ^+ is the regularization parameter. A key feature of the Lasso is that the estimate Gl is 
the solution to a convex optimization problem. 

As in ( Park and Casella[ 2008| l, we describe a derivation of the Lasso through the marginaliza- 
tion of a suitable probability density function. This hierarchical representation is useful for estab- 
lishing a connection with the variety of estimators considered in this paper. The Bayesian model we 
consider is depicted in Fig. [TJa). Nodes and arrows are either dotted or solid depending on being 
representative of, respectively, deterministic or stochastic quantities/relationships. Here, A denotes 
a vector whose components are independent and identically distributed exponential random 

variables with probability density 

Py{Xd = Ye-'^-X{^d, (3) 
where 7 is a positive scalar while = 1 if ? > 0, otherwise. In addition 

0,|A,-~^(O,A,-) and v^^{0,a^ln), (4) 

where is the Gaussian density of mean jj. and covariance Z while /„ is the n xn identity 

matrix. We have the following result from Section 2 in (Park and Casella 2008 1. 

Theorem 1 Given the Bayesian network in Fig. [TJa ), let 

= argmax / p(G,X\y)dX, (5) 

^ ^ 

Then G = Gi provided that Jl = \/2y. 



2.2 TheHLasso 

Theorem[T]inspires the definition of an estimator obtained by marginalizing with respect to G instead 
of X and then maximizing the resulting marginal density p{X\y) with respect to X obtaining an 
estimate X for X. Having X, we use an empirical Bayes approach and set Ghl '■= IE[0b,A] (the 
minimum variance estimate of G given y and X = X). We call Ghl the Hyperparameter Lasso 
(HLasso). This estimator is given in the next theorem which uses the fact that G conditional on X 
is Gaussian, so that the marginal density of X is available in closed form. The proof flows from the 
following observations: 

p{G,X\y) oc \A\-^^^exp[ 

•exp[-^(0 

where 

A = diag(A), I,(A):=(a2/ + GAGT), (7) 

and 

Ghl{X) ■.= E[G\y,X] = {a^A-^ +G'^Gy^G'^y = AG'^Ly{Xy^y (8) 



(6) 



• GHL{X)y{A-'+a-^G'^G){G - GHL{X))]exp[-Yl'^ X], 
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Notice that the equivalence of the two expressions for the Qhl{^) follows from the matrix inversion 
formula 



(9) 



Also note that (jsjl assures us that Qhl{^) well defined even when some of the components of A are 
zero. The matrix ^y{X) plays a fundamental role in much of the analysis of this paper. The Law of 
Iterated Expectation tells us that ry(A) is simply the second moment of y given A, indeed, 

K\yy'^\X] = E[E[);3;T|0]|A] 

= E[Vartv|0]+E[3;|0]Etv|e]T|A] 

= a2/ + GE[00T|A]G^ 

= M^)- (10) 



Theorem 2 Given the Bayesian network in Fig. [TJa ), let 



A=argmax / p(6,X\y)dG. 



(11) 



Then 



11 

I = arg mm -logdet(£,(A)) + -/(r3,(A))-V + 7£A;, 



'Xe«l2 2 
and, given X =X, the HLasso estimate of d is given by 

%l:=E[0|3;,A] 



(12) 



(13) 



The objective in ( 1 1 1 depends on m variables as in the Lasso case, however the optimization 
problem is no-longer convex since the function logdet(ry(A)) is a concave function of A as it is the 
composition of the concave function logdet(r) and the affine function 'Ly{X). 

It is worth observing that the estimator obtained from ( [TT] ) and (13 1 is a form of "Sparse Bayesian 
Learning" having close resemblance with ARD ( Mackay) 1994) and RVM (Tippi ng} [2001] ), see also 
( [Wipf and Nagarajan 2007 1. In fact ARD is obtained by setting 7 in ^V2\ to zero while the Gamma 
prior used in RVM (see eq. (6) in ( |Tipping[ |2001[ )) seems to play a symmetric role, favoring large 
values of A,'s. Note that the Gamma prior in equation (6) of (Tipping 2001 1 becomes flat as a — )■ 
and — )■ 0, similarly to (|3]l as 7 — )• 0. A similar discussion applies also to the Group version of this 
estimator to be introduced in Section 1X2] 

In Sections [5] and [6] we show that the parameter 7plays a fundamental role in enforcing sparsity. 
In addition, we establish an interesting interpretation in terms of the Mean Squared Error properties 
of the resulting estimators as 7 — 0. Note also that 7 plays a fundamental role in model selection 



consistency (see Remark 1 1 1 
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3. GLasso and HGLasso 

We now consider a situation where the explanatory factors G used to predict y are grouped. Think 
of 6 as being partitioned into p sub-vectors d^'\i = 1, so that 

For / = !,...,/?, assume that the sub-vector 6^'^ has dimension kj so that m = Y.f^i kj. Next, confor- 
mally partition the matrix G = [G^^\. .. , G'^^] to obtain the measurement model 



p 

•y = G0+v = £G(')0(')+v. (15) 

;=1 



In what follows, we assume that 6 is block sparse in the sense that many of the blocks 6^'^ are null, 
i.e. with all of their components equal to zero, or have a negligible effect on y. 

3.1 The GLasso 



A leading approach for the block sparsity problem is the Group Lasso (GLasso) (Yuan and Lin 



2006 1. The Group Lasso determines the estimate of 6 as 



(y-GdViy-Gd) p„ ^■^„ 

©GL = arg min ^ ^ + (16) 



where || • || denotes the classical Euclidean norm. Notice that the representation ( 16 1 assumes that 
the 0*^'' are i.i.d. with 



p{Q^'^ |7gl) °<^exp 



-Ygl 



It is easy to see that, as in the Lasso case, the objective is convex. 



3.2 The HGLasso 



An alternative approach to the block sparsity problem is discussed in ( Chiuso and Pillonetto 2010b I. 
This approach relies on the group version of the model in Fig. 1(a) illustrated in Fig. 1(b). In the 
network, A is now a p-dimensional vector with / — th component given by A,- G R+. In addition, 
conditional on A, each block 0^'^ of the vector d is zero-mean Gaussian with covariance 
/ = l,..,p, i.e. 

0«|A,~A^(O,AA,). (17) 

As for the HLasso, the proposed estimator first optimizes the marginal density of A, and then again 
using an empirical Bayes approach, the minimum variance estimate of d is computed with A taken 
as known and set to its estimate. We call this scheme Hyperparameter Group Lasso (HGLasso). It 
is described in the following theorem. 



Theorem 3 Consider the Bayesian network in Fig. 1 (b) with measurement model given by (15\, 
(17 ), and and define 



A=argmax/ p{d,?i\y)dd. (18) 
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Then, X is given by 



where 



arg min llogdet(r,.(A)) + VlT^ (A)3; + A,-, 



(19) 



Ij.(A) :=GAG^ + a2/, A := blockdiag{{XiIk,]) . (20) 
In addition, the HGLasso estimate of 6, denoted OhgLj i^ given by setting A = A in the function 

OhglW :=E[d\y,X]=AG'^{Ly{X))-'y. (21) 



The derivation of this estimate is virtually identical to the derivation of the estimate given in 
Theorem |2] For this reason, we slightly abuse our notation by not introducing a new notation for 
the key affine matrix mapping I^yiX). Just as in the HLasso case, the objective in ( 19 1 is not convex 



in A. However, now the optimization is performed in the lower dimensional space R'', rather than 
in M'" where the GLasso objective is optimized. 

Let the vector jj. denote the dual vector for the constraint A > 0. Then the Lagrangian for the 



problem ( 19 1 is given by 

L(A,At) :=\logdet{-Ly{X)) + \y^'Ly{X)-^y + Yl^X-^^X. 

Using the fact that 

d,L{X,^) = 2tr(G(')Tr,(A)-iG(')) 

- ^3'^I,(A)-iG«G«Tz,(A)-i3' + 7 



(22) 



we obtain the following KKT conditions for ( 19 1. 



Proposition 4 The necessary conditions for X to be a solution of {19) are 
E, = a2/ + i:f^iA,-GWGWT 



W^y=l 



tr (gW^WGW) - \\G^^'^Wy\\l + 2y- 2Hi = 0, i=l,...,p 
HiXi = 0, i=l,...,p 
0<n, X andO^W,'Ly. 



(23) 



It is interesting to observe that, by ( [T0| ), one has 

E 



eHGdX)eHGL{Xy I X = AGTly(A)-^E[3;/ I X]I.y{X)-^GA = AG'^I.y{X)-^GA, 



and so 



E 









X 



A2fG«TwG«V /=l,...,p. 



(24) 



In addition, 



\4U^)\\' = ^i\\G^'^^Wy\\l i=l,...,p. 
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Equation p3| ) indicates that when tuning A there should be a link between the "norm" of the actual 
estimator to its a priori second moments (24 1. In particular, when no regularization is 

imposed on A (i.e. 7 = 0) and the nonnegativity constraint is not active, i.e. /x, = 0, one finds that 
the optimal value of A; makes the norm of the estimator equal to (the trace of) its a priori matrix of 
second moments. 



3.3 GLasso does not derive from marginalization of the posterior 

Differently from the Lasso case, when the block size is larger than 1 , GLasso does not derive from 
marginalization of the Bayesian model depicted in Fig. [TJb). To see this, consider the problem of 
integrating out A from the joint density of 6 and A described by the model in Fig. [T|b). The result is 
the product of multivariate Laplace densities. In particular, if B^'^ (•) is the modified Bessel function 
of the second kind and order ki/2 — 1, then, following ( Eltoft et aE] [2006 1, we obtain 



\2-ki/4 



(e(')Te(0)W4-2 



(25) 



whereas the prior density underlying the GLasso must satisfy 

p 

I 

(=1 



M0)-exp(-7GLll|0«||). 



(26) 



One can show that, for kj > 1 with 0^') tending to zero the prior density on used in the GLasso 
remains bounded, while the marginal of the density used for HGLasso in p5|) tends to oo. 



4. Relationship with Multiple Kernel Learning 



Multiple Kernel Learning (MKL) can be used for the block sparsity problem (Bach et al. 2004 



Evgeniou et aH |2005t |Dinuzzo[ |2010t |Bach| |2008| l. To introduce this approach consider the mea- 
surements model 

y = f+v = tf 



f(0 



+ v 



(27) 



where v is as specified in (|4]l. In the MKL framework, / represents the sampled version of a scalar 
function assumed to belong to a (generally infinite-dimensional) reproducing kernel Hilbert space 
(RKHS) Wahba, (1990). For our purposes, we consider a simplified scenario where the domain of 
the functions in the RKHS is the finite set [1, . . . ,«]. In this way, / represents the entire function and 
y is the noisy version of / sampled over its whole domain. In addition, we assume that / belongs to 
the RKHS, denoted J^k, having kernel defined by the matrix 



(28) 



i=i 



where it is further assumed that each of the functions f^'^ is an element of a RKHS, denoted Jf^'\ 
having kernel A;/r('' with associated norm denoted by ||/(') || (,). 
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According to the MKL approach, the estimates of the unknown functions /^'^ are obtained 
jointly with those of the scale factors A, by solving the following inequality constrained problem: 

({/«},A)= argmin ^^^^^^^^^ + £ ||/« || J) 



{/(')}, A el 



;=1 



s.t. £a,<m, 



(29) 



i=l 



where M plays the role of a regularization parameter. Hence, the "scale factors" contained in A € R+ 
are optimization variables, thought of as "tuning knobs" adjusting the kernel K{X) to better suit the 
measured data. Using the extended version of the representer theorem, e.g. see ( [Dinuzzo 2010[ 



Evgeniou et aH |2005| l, the solution is 

/«=A,A^ i = i,...,p, 



(30) 



where 



{c,X} = argmin 

ceR'',AeR,t 

S.t. £ Xi < M. 



{y-K{X)cY{y-K{X)c) 



'C^K{X)c 



p 



(31) 



It can be shown that every local solution of the above optimization problem is also a global solution. 



see (Dinuzzo 2010) for details 



For our purposes, it is useful to define as the Gaussian vector with independent components 
of unit variance such that 

ei = ^/Xi^i. (32) 



We partition conformally with 6, i.e 





{P) 



(33) 



Then, the following connection with the Bayesian model in Fig. [TJb) holds. 



Theorem 5 Consider the joint density of<\) and X conditional on y induced by the Bayesian network 
in Fig. ^b). Set K^^^ = G'-'^G^'^^, / = 1, . . . ,/?. Then, there exists a value of y (function ofM) such 
that the maximum a posteriori estimate of X for this value ofy( obtained optimizing the joint density 
of and X ) is the X from ( 31 1. In addition, for this value of y one has 



p 

I 



and the c in {31 ) is given by 



, . y'^{K{X) + a^I)-'y , ^ 

A = arg nun + 7> A; 

XeK 2 ^ 



c(A) = (^(A) + a^/)-'3'. 



(34) 



(35) 
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Again, for this value of J, the maximum a posteriori estimates of the blocks of (j) are 
Finally, one has 



d. 



GL 



(36) 
(37) 



where doL i^ GLasso estimate {16) for a suitable value ofjoL- 



We supply the KKT conditions ( 34 ) in the following proposition. 



Proposition 6 The necessary and sufficient conditions for X to be a solution of {34) are 



W'Ly=I 



- IIgWTWjII^ + 2r-2Hi = 0, i=l,...,p 
lXiXi = 0, i=\,...,p 
0<H, X andO^W,Ly. 



(38) 



4.1 Concluding remarks of the section 



Eq. 37 in Theorem |5] states the equivalence between MKL and GLasso. It is a particular instance 
of the relationship between regularization on kernel weights and block-norm based regularization, 



see Theorem 1 in [Tomioka and Suzuki (2011). In the next sections, such connections will help 
in understanding the differences between GLasso and HGLasso by comparing the KKT conditions 
derived in Propositions [4] and [6] 

Notice also that the GLasso estimate provides the maximum a posteriori (MAP) estimate of but 
not that of 0. In fact, \/~Xi^^'^ is not the MAP estimate of d^'\ In this regard, it is not difficult to see 
that, according to the model in Fig. [TJb), the joint density of 6 and X given y is not bounded above 
in a neighborhood of the origin. Hence, the MAP estimator of 6 would always return an estimate 
equal to zero. One can however conclude from Theorem [5] that MKL (GLasso) arises from the 
same Bayesian model as the HGLasso considering and X as unknown variables. The difference 
is that the MKL estimate of X is obtained by maximizing a joint rather than a marginal density. 



It is worth comparing the expression for the MKL estimator in ( 34 1 with the expression for the 
HGLasso estimator given in ( 19 1. Under the assumptions stated in Theoremjsj I^yiX) = K{X) + G^I. 



Hence, the objectives in ( 34 1 and ( 19 1 differ only in the term ^ log det(£v) appearing in the HGLasso 



objective (|T9|). Notice also that this is the component that makes problem ( 19) non-convex. On the 



other hand, it is also the term that forces the HGLasso to favor sparser solutions than the MKL since 
it makes the marginal density of X more concentrated around zero. 



5. Sparsity vs. Shrinkage: A simple experiment 

It is well known that the £i penalty in Lasso tends to induce an excessive shrinkage of "large" co- 
efficient in order to obtain sparsity. Several variations have been proposed in the literature in order 
to overcome this problem, including the so called Smoothly-Clipped-Absolute-Deviation (SCAD) 
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estimator in (Fan and Li 2001 1 and re-weighted versions of £i like the adaptive Lasso (Zou 2006 1. 
We now study the tradeoffs between sparsity and shrinking for HLasso/HGLasso. By way of in- 
troduction to the more general analysis in the next section, we first compare the sparsity conditions 
for HGLasso and MKL (or, equivalently, GLasso) in a simple, yet instructive, two group example. 
In this example, it is straightforward to show that HGLasso guarantees a more favorable tradeoff 
between sparsity and shrinkage, in the sense that it induces greater sparsity with the same shrinkage 
(or, equivalently, for a given level of sparsity it guarantees less shrinkage). 
Consider two groups of dimension 1, i.e. 



y- 



G(1)0(1)+g(2)0(2)+, 



(39) 



where G*^) = [1 5]^, G^^) = [0 1]^, v ~ ^{0,0^). Assume 0^^) = 0, 0^) = i Qur goal is to 
understand how the hyperparameter / influences sparsity and the estimates of 6^^^ and 0^^' using 
HGLasso and MKL. In particular, we would like to determine which values of 7 guarantee that 
qW = and how the estimator 6^'^'^ varies with 7. These questions can be answered by using the 
KKT conditions obtained in Propositions |4] and [6] 



Let J := \yi 3^2]^ and recall that i^(') := G^') (G^'') . By (|23|), the necessary conditions for Ai = 
and A2 > to be the hyperparameter estimators for the HGLasso estimator (for fixed 7) are 

2 



271 



HGL 



> 



+ 



O'- + ff2 



and 



max 



-l+^/l+Srj/GLyj ^2 



a^ 



(40) 



Similarly, by (38 1, the same conditions for MKL read as 



2Ymkl > 



and 



3 MKL 
A2 



max 



a2,0 



}■ 



(41) 



Note that it is always the case that the lower bound for Jmkl is strictly greater than the lower 
bound for yuGL and that X^*^^ ^ A|^^^ when Jugl = ^mkl, where the inequality is strict whenever 
\MKL y Q rpj^g corresponding estimators for ^'^^ and 0^^) are 



^HGL 



^MKL 







fl(2) . 
^HGL 



and 



fl(2) . 
^MKL 



(42) 



Hence, 1 0^^| < \§mkl\ whenever 3^2 7^ and X^^^ > 0. However, it is clear that the lower bounds 



a (2) 



hence d^j^^ = 0). Of course, having a larger 7 tends to yield smaller A2 and hence more shrinking 

on This is illustrated in figure|2] where we report the estimators O^q^ (solid) and 0^^^ (dotted) 
for (7^ = 0.005, 5 = 0.5. The estimators are arbitrarily set to zero for the values of 7 which do not 
yield 0^^) = 0. In particular from (|40]l and (|4T]) we find that HGLasso sets = for Yhgl > 5 
while MKL sets § 

a (2) 



on 7 in (40 1 and (41 1 indicate that Ymkl needs to be larger than Yhgl in order to set Xf^^^ = (and 



MKL 



for Ymkl > 20. In addition, it is clear that MKL tends to yield greater 



shrinkage on dj^j^^ (recall that = 1). 
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7 



Figure 2: Estimators 6^^) as a function of 7. The curves are plotted only for the values of 7 which 
yield also = (different for HGLasso (Yhgl > 5) and MKL (Ymkl > 20)). 



6. Mean Squared Error properties of Empirical Bayes Estimators 

In this Section we evaluate the performance of an estimator § using its Mean Squared Error (MSE) 
i.e. its expected quadratic loss 



tr 



E 



{e-e) {e-ef 



where 6 is the "true" but unknown value of G. When we speak about "Bayes estimators" we think 
of estimators of the form 0(A) := E [0 | A] computed using the probabilistic model Fig. [Tjwith 7 
fixed. 



6.1 Properties using "orthogonal" regressors 

We first derive the MSE formulas under the simplifying assumption of "orthogonal" regressors 
{G^G = nl) and show that the Empirical Bayes estimator converges to an "optimal" estimator 



in terms of its MSE. This fact has close connections to the so called "Stein" estimators (James 



and Stein 1961| l, ( Stein[ 1981 1, (Efron and Morris 1973 1. The same optimality properties are 
attained, asymptotically, when the columns of G are realizations of uncorrelated processes having 
the same variance. This is of interest in the system identification scenario considered in (|Chiuso and 



Pillonetto 2010a|b 2011 1 since it arises when one performs identification with i.i.d. white noises as 



inputs. We then consider the more general case of correlated regressors (see Section 6.2 1 and show 
that essentially the same holds for a weighted version of the MSE. 
In this section, it is convenient to introduce the following notation: 

E,[-]:=E[-|A,0 = 0] and Var,[-] := E[- 1 A, = 0]. 

We now report an expression for the MSE of the Bayes estimators 0(A) :=E[0|y,A] (proof follows 
from standard calculations and is therefore omitted). 
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Proposition 7 Consider the model ( 15 1 under the probabilistic model described in Fig. ^b ). The 
Mean Squared Error of the Bayes estimator 6{X) : = E A] given X and G = 6 is 



MSE{X) = tr Ev {e{X)-e){e{X)-e) 



tr 



-1 



.(43) 



We can now minimize the expression for MSE{X) given in (43 1 with respect to X to obtain the 



optimal minimum mean squared error estimator. In the case where G^ G = nl this computation is 
straightforward and is recorded in the following proposition. 

Corollary 8 Assume that G^G = nl in Proposition^ Then MSE( X ) is globally minimized by choos- 
ing 

i=l,...,p. (44) 

ki 



Xs — X 



opt 



Next consider the Maximum a Posteriori estimator of X again under the simplifying assumption 
G^G = nl. Note that, under the noninformative prior (7 = 0), this Maximum a Posteriori estimator 
reduces to the standard Maximum (marginal) Likelihood approach to estimating the prior distribu- 
tion of 6. Consequently, we continue to call the resulting procedure Empirical Bayes (a.k.a. Type-II 
Maximum Likelihood, ( [Berger 1985)). 



Proposition 9 Consider model (J_5 1 under the probabilistic model described in Fig. \l\b), and as- 
sume that G^G = nl. Then the estimator of A, obtained by maximizing the marginal posterior 

vim 

(45) 



{Ai(7),...,Ap(7)} :=argmaxp(A|3;) = argmax / p{y,d\X)py{X)de, 



is given by 



where 



A; (7) = max 



'47 



4a^7 



(46) 



y 



is the Least Squares estimator of the i—th block 7 — t- (7 = corresponds to an improper 

flat prior) the expression (|46l) yields: 



( 110^''^ IP CT^' 

limA/(7) = max 0, " 

7-^-0 \ ki n 



(47) 



In addition, the probability P[A,(7) = 0\ d = d] of setting A; = is given by 



P[A;(7)=O|0 = e] 



(48) 



where X'^{d,IJ-) denotes a noncentral random variable with d degrees of freedom and noncen- 
trality parameter /I. 
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Note that the expression of A/(7) in Proposition [9] has the form of a "saturation". In particular, 
for 7 = 0, we have 

||fl(0||2 2 

A,-(0) =max(0,I*), where i* := . (49) 

ki n 

The following proposition shows that the "unsaturated" estimator A* is an unbiased and consistent 
estimator of X"'^' which minimizes the Mean Squared Error while A;(0) is only asymptotically 
unbiased and consistent. 

Corollary 10 Under the assumption G^G = nl, the estimator of X* := in ( |49[ ) is an 

unbiased and mean square consistent estimator of X"^^ which minimizes the Mean Squared Error, 
while /L(0) := {Ai(0), ..,Ap(0)} is asymptotically unbiased and consistent, i.e.: 

E[A; I = 0] = A;^' limE[A;(O)|0 = 0]=Af^' (50) 

and 

lim A;'"=- a;''' lim Xm Af^' (51) 
where = denotes convergence in mean square. 

Remark 11 Note that if 6^'^ = the optimal value is zero. Hence (51 1 shows that asymptoti- 



cally Ai(0) converges to zero. However, in this case, it is easy to see from (48 1 that 

limP[A,(0) =O|0 = 0] < 1. 

There is in fact no contradiction between these two statements because one can easily show that for 
all £ > 0, 

P[A;(O)G[O,£)|0 = 0]'-^1. 

^ — 2 

In order to guarantee that lim„^txj P[A,(7) = 1 = 0] = 1 one must chose y=ynSO thatl^y,, — )• 00, 
so that Yn grows faster than n. This is in line with the well known requirements for Lasso to be 
model selection consistent. In fact, Theorem^shows that the link between y and the regularization 
parameter Yl for Lasso is given by Yl = \/27. The condition n^^Yn ~^ °° translates into n~^I^YLn — ^ 



00, a well known condition for Lasso to be model selection consistent {Zhao and Yu 2006' Bach 
20m. 



The results obtained so far suggest that the Empirical Bayes resulting from HGLasso has desir- 
able properties with respect to the MSE of the estimators. One wonders whether the same favorable 
properties aie inherited by MKL or, equivalently, by GLasso. The next proposition shows that this 
is not the case. In fact, for W 7^ 0, MKL does not yield consistent estimators for X-'^' ;in addition, 
for 0''' = 0, the probability of setting A,- (7) to zero (see equation ( [55] )) is much smaller than that 
obtained using HGLasso (see equation ([48])); this is also illustrated in Figure[3](top). Also note that, 
as illustrated in Figure [3] (bottom), when the "true" is equal to zero, MKL tends to give much 
larger values of A than those given by HGLasso. This results in larger values of || 1| (see Figure pjl. 
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Proposition 12 Consider model ( |15| ) under the probabilistic model described in Fig. \I^b), and 
assume G^G = nl. Then the estimator of Xi obtained by maximizing the joint posterior p(A,0|3') 



( see equations ( |32| ) and ( |33[ ) ), 



{A(7),...,Ap(7)} :=arg max p(A,(/>b), (52) 

is given by 

( 110^'^ II aA 
AK7)=max(0,^^- — j , (53) 

where 

is the Least Squares estimator of the i—th block 6^'^ for i = \ ,.. .,p. For « — )• oo the estimator A, (7) 
satisfies 

limAK7)'=^. (54) 



(55) 



In addition, the probability P[A,(7) = 1 = 0] of setting A,(7) = is given by 



Pe[A/(7)=O|0 = 0] 



r2 



Note that the hmit of the MKL estimators A, (7) as « — 00 depends on 7. Therefore, using 
MKL (GLASSO), one cannot hope to get consistent estimators of A"'". Indeed, for 7^ 0, 

consistency of A/ (7) requires 7— )• ^pfyyp. which is a circular requirement. 
6.2 Asymptotic properties using general regressors 

In this subsection, we replace the deterministic matrix G with Gn{(o), where Gn{(o) represents an 
nxm matrix defined on the complete probability space (n,^,P) with (O a generic element of Q. 
and =^ the sigma field of Borel regular measures. In particular, the rows of G„ are independenj^ 
realizations from a zero-mean random vector with positive definite covariance *F. We will also as- 
sume that the (mild) assumptions for the convergence in probability of G^Gn/n to *P, as n goes to 



00, are satisfied, see e.g. (Loeve 1963 1. 



As in the previous part of this section, A and B are seen as parameters, and the "true" value of is 0. 
Hence, all the randomness present in the next formulas comes only from G„ and the measurement 
noise. Below, the dependence of rv(A) on G„, and hence of n, is omitted to simplify the notation. 
Furthermore, — )-p denotes convergence in probability. 



Theorem 13 For known 7 and conditional on G = G, define 

r = arg min ^ logdet(r,(A)) + \y^ lL-\X)y yj^Xu (56) 



2. The independence assumption can be removed and replaced by mixing conditions. 
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Sparsity vs. Bias (Shrinking) 
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Figure 3: This plot has been generated assuming that there are two blocks {p = 2) of dimension 
ki=k2 = 10 with 0^^' = and all the components of the true 0^^) G M'" set to one. The 
matrix G equal to the identity, so that the output dimension {y G M") is n = 20; the noise 
variance equal to 0.1. Left: probability of setting 0^^) to zero vs Mean Squared Error 
m 

Curves are parametrized in 7 G [0,+oo). Right: Mean Squared Error in Q^^> vs 
Mean Squared Error in d'^'^\ Curves are parametrized in 7 G [0, +00). 



||g'')f 



where ^ is any p-dimensional ball with radius larger than max, — ^ 
Then, we have 



A" 



1" 



-h+^k}+^y\\m\\^ 
Ay 

. lie'''' IP 
k, 



A" 







if 7>0 and ||0(')||>O, 

if 7=0 and > 0, anJ 

if 7>0 and ||e(')||=0. 



(57) 

(58) 
(59) 



We now show that, when 7=0, the above result relates to the problem of minimizing the MSE 
of the /-th block with respect to A,, with all the other components of A coming from A". If dn\X) 
denotes the /-th component of the HGLasso estimate of 6 defined in (21 1, our aim is to optimize the 
objective 



MSEn{Xi):=tv 



(0i'^(A)-0«)(0«(A)-0 



with Xj = X'j for j / / 



where is A" is any sequence satisfying condition 



lim /„ = +00 where /„ := min nAf , 



(60) 



where l\ := {j : 77^ / and 0*^^' 7^ 0} (condition ( [6OI ) appears again in the Appendix as (|94]l). Note 
that, in particular. A" = A" in (|56]l satisfy ( |60l ) in probability. 

In the following lemma, whose proof is in the Appendix, we introduce a change of variables 
that is key for our understanding of the asymptotic properties of these more general regressors. 
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Lemma 14 Fix / G { 1 , . . . , and consider the decomposition 



(61) 



of the linear measurement model \\5\ and assume ( |88| ) holds. Define 



and assume that Xj finite \/j ^ i. Consider now the singular value decomposition 

^^'''^'^ Ulpl^iv!^ (62) 



where each = diag{d'f^^ is k, x ki diagonal matrix. Then (61 1 can be transformed into the 
equivalent linear model 

= D»Ai'^ + £i'\ (63) 

where 



and Dn'' is uniformly (inn) bounded and bounded away from zero. 



^1 —\\n)^ 



(64) 



Lemma [14] shows that we can consider the transformed linear model associated with the /-th 
block, i.e. 

\n — "-kflPkfl^ H,n^ K—L,...,K,, K^J) 

where all the three variables on the RHS depend on A" for j ^ i. In particular, the vector jSi'^ 

consists of an orthonormal transformation of 6^') while the d^^^ are all bounded below in probability. 
In addition, by letting 

lEv Y^)\ = - '"^'")'] = ^M' (66) 

we also know from Lemma 17 (see equations (|96]) and ( |97| )) that, provided A" (j ^ i) satisfy con- 
dition (60 1, both m^M and tend to zero (in probability) as n goes to oo. Then, after simple 

computations, one finds that the MSB relative to jSi'^ is the following random variable whose statis- 
tics depend on n: 

MSE„{Xi) = l^- , . .2 .2 With Xj = Xj for j^i. 

Above, except for A,, the dependence on the block number / was omitted to improve readability. 
Now, let A" denote the minimizer of the following weighted version of the MSEn{Xi): 

\n ■ ^ A^ ^^.n + n^f4.n ('"Li + " ^hdk,nmk,nh.n 

A; = arg min 2^ „ — ^ ,i m • 

Then, the following result holds. 
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Proposition 15 For 7 = and conditional on G = G, the following convergences in probability hold 



lim A," 



lim , / = 1,2, . . . ,p. 



(67) 



The proof follows arguments similar to those used in last part of the proof of Theorem 13 



see 



also proof of Theorem 6 in Aravkin et al. (2012 1, and is therefore omitted. 

We can summarize the two main findings reported in this subsection as follows. As the number 
of measurements go to infinity: 

1. regardless of the value of 7, the proposed estimator will correctly set to zero only those A; 
associated with null blocks; 



2. when 7 = 0, (58l and (59) provide the asymptotic properties of ARD, showing that the esti- 



mate of A, will converge to the energy of the i-th block (divided by its dimension). This same 
value also represents the asymptotic minimizer of a weighted version of the MSB relative to 
the i-th block. In particular, the weights change over time, being defined by singular values 
d^'l^, (raised at fourth power) that depend on the trajectories of the other components of A. 



6.2.1 Marginal likelihood and weighted MSE: perturbation analysis 

We now provide some additional insights on point 2 above, investigating why the weights d^^^ may 
lead to an effective strategy for hyperparameter estimation. 

For our purposes, just to simplify the notation, let us consider the case of a single ni-dimensional 



block. In this way, A becomes a scalar and the noise „ in ( 65 1 is zero-mean of variance l/n. 
Under the stated assumptions, the MSE weighted by d"^, with a an integer, becomes 



k=l 



{n-'+Xdl 



(68) 



whose partial derivative with respect to A, apart from the scale factor 2/?i, is 



(69) 



Let jS,t = lim„H^oo j8i_„ and dj^ = limn^oodj^,,^ When n tends to infinity, arguments similar to those 
introduced in the last part of the proof of Theorem 13 show that, in probabihty, the zero of F« 
becomes 

Lk=i<^k Pk 



Xia) 



k= 



-\"-k 



(70) 



Notice that the formula above is a generalization of the first equality in ( 67 1 that was obtained by 
setting a = 4. However, for practical purposes, the above expressions are not useful since the true 



3. We are assuming that both of the hmits exist. This holds under conditions ensuring that the SVD decomposition 
leading to l |65| l is unique, e.g. see the discussion in Section 4 of (Bauer 20051, and combining the convergence of 
sample covariances with a perturbation result for the Singular Value Decomposition of symmetric matrices (such as 



Theorem 1 in i Bauer 2005 i, see also ( ChateUn 1983 1) 
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values of Pk,n and j3yt depend on the unknown 6. One can then consider a noisy version of Fa 
obtained by replacing jS^ „ with its least squares estimate, i.e. 

where the random variable v,t „ is of unit variance. For large n, considering small additive pertur- 
bations around the model Zk = dk^k, it is easy to show that the minimizer tends to the following 
perturbed version of X : 

, l!k=l'^k^^PkV k,n 

It remains to choose the value of a that should enter the above formula. This is far from trivial 
since the optimal value (minimizing MSB) depends on the unknown jSyt. On one hand, it would 
seem advantageous to have a close to zero. In fact, a = relates X to the minimization of the MSB 
on 6 while a = 2 minimizes the MSB on the output prediction, see the discussion in Section 4 of 



X{a) + 2 ^'-^ll Z^T - (72) 



Aravkin et al. (2012 ). On the other hand, a larger value for a could help in controlling the additive 



perturbation term in (72) possibly reducing its sensitivity to small values of dk- For instance, the 
choice a = introduces in the numerator of (72i the term Pk/d^. This can make numerically 
unstable the convergence towards X, leading to poor estimates of the regularization parameters, 
as e.g. described via simulation studies in Section 5 of Aravkin et al. ( 2012| ). In this regard, the 
choice a = 4 appears interesting: it sets X to the energy of the block divided by m, removing the 
dependence of the denominator in ( [72] ) on dk- In particular, it reduces (72 ) to 

11^8 f ^ 2 l5kVk,n _ f Pi ( ^ ,3 Vk,n \ ^73^ 

m y/ndk m \ jiky/ndk J ' 

It is thus apparent that a = 4 makes the perturbation on ^ dependent on p^^^^ that is the rela- 
tive reconstruction error on pi^. This appears a reasonable choice to account for the ill-conditioning 
possibly affecting least-squares. 

Interestingly, for large n, this same philosophy is followed by the marginal likelihood procedure 
for hyperparameter estimation up to first-order approximations. In fact, under the stated assump- 
tions, apart from constants, the minus two log of the marginal likelihood is 

£log(.-'+A4j + ^_i^^'" , ' (74) 



k=\ ~^^^kM 



whose partial derivative w.r.t. X is 



^k,n^^ ^^k,n ^,n^k,n 



As before, we consider small perturbations around Zk = dkPk to find that a critical point occurs at 



(75) 



which is exactly the same minimizer reported in ([73]). 
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7. Three variants of HGLasso and their implementation 

In this section we discuss the implementation of our HGLasso approach. In particular, the results 
introduced in the previous section point out some distinctive features of HGLasso with respect to 
GLasso (MKL). In fact, we have shown that HGLasso relies upon an estimator for the hyperparam- 
eter A having some favorable properties in terms of MSB minimization. In addition, sparsity can be 
induced using a smaller value for 7 than that needed by GLasso. This is an important point since 
we have seen that nice MSB properties are obtained optimizing the marginal posterior of A with 7 
close to zero. 

On the other hand, a drawback of the HGLasso is that it requires the solution to a non-convex 
optimization problem in a possibly high-dimensional space. We show that this problem can be 
faced by introducing a variant of HGLasso where only one scalar variable is involved in the opti- 
mization process. In addition, this procedure is able to promote greater sparsity than the full version 
of HGLasso while continuing to accurately reconstruct the nonzero blocks. This new computational 
scheme, which we call HGLa, relies on the combination of marginal likelihood optimization and 
Bayesian forward selection equipped with cross validation to select 7. It is introduced in the fol- 
lowing subsections together with two other versions that will be called HGLb and HGLc. The 
following two subsections are instrumental to the introduction of the three algorithms. 



7.1 Bayesian Forward Selection 

In this section we introduce a forward-selection procedure which will be useful to define the com- 
putationally efficient version of the HGLasso estimator. Hereafter, we use and jva/ to indicate the 
output data contained in the training and validation data set, respectively. This also induces a natu- 
ral partition of G into Gu and Gy,a\- In order to obtain an estimator of X we consider the constraint 
= Ai = A2 = . . . = Ap and treat as a deterministic hyperparameter whose knowledge makes the 
covariance Zy,, of the data y^r completely known. Therefore we set: 

arg min ^ log det(Iy„ ) + \yl^'ylytr {11) 



Now, we consider again the Bayesian model in Big. [TJb), where all the components of A are fixed 
to K" while 7 may vary on a grid C built around k^^ . The forward-selection procedure is then 
designed as follows; for each value of 7 in the grid C let / C {1,2, ..,/?} be the subset of currently 
selected groups and, using the Bayesian model in Big. [TJb) (see also ([6])-([8]l), define the marginal 
log posterior 



L{I,k,y) :=log PyiXilytr 



(78) 



A/ := [A/,1, A/,p] and A/,, = kif i £ I and A/ ,- = otherwise. Then, for each value of 7 in the grid 
C, perform the following procedure: 



initiaUze 7(7) : = 

repeat the following procedure: 

(a) for j € {1,..,p}\I{y), define 7^(7) :=7(7) U j and compute L (7^(7); k,Y). 

(b) select 

J:= argmax L{l]{y)\k,y)-L{I{y);k,y) 
ie{i,..,ri\/(7) 
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(c) ifL{i'j{ry,k,r)-L{i{ry,k,Y)>o 

set 7(7) := i'j{y) and go back to (a) 

else 

finish. 



Note tliat, for eacli 7 in tiie grid, the set 7(7) contains the indexes of selected variables different from 
zero. Let 7 denote the value of 7 from the grid that yields the best prediction on the validation data 
set y^ai, i.e. 

7 = arg min \\yyai - G^ai Ohgl {\(y) ) \ \ 

and set Ifs = iif)- 



7.2 Projected Quasi-Newton Method 



The objective in ( 19 1 is a differentiable function of A. The computation of its derivative requires a 



one time evaluation of the matrices G^'^G''^ , i = I,. . . ,p. However, for each new value of A, the 
inverse of the matrix ^^(A) also needs to be computed. Hence, the evaluation of the objective and 
its derivative may be costly since it requires computing the inverse of a possibly large matrix as well 
as large matrix products. On the other hand, the dimension of the parameter vector A can be small, 
and projection onto the feasible set is trivial. 

We experimented with several methods available in the Matlab package minConf to optimize 



( 19 1. In these experiments, the fastest method was the limited memory projected quasi-Newton 



algorithm detailed in (Schmidt et al. 2009 1. It uses L-BFGS updates to build a diagonal plus low- 
rank quadratic approximation to the function, and then uses the Projected Quasi-Newton Method to 
minimize a quadratic approximation subject to the original constraints to obtain a search direction. 
A backtracking line search is applied to this direction terminating at a step-size satisfying a Armijo- 
like sufficient decrease condition. The efficiency of the method derives in part from the simplicity 
of the projections onto the feasible region. We have also implemented the re-weighted method 
described in ( |Wipf and Nagarajan 2007| l. In all the numerical experiments described below, we 



have assessed that it returns results virtually identical to those achieved by our method, with a 
similar computational effort. It is worth recalling that both the projected quasi-Newton method and 
the re-weighted approach guarantee only converge to a stationary point of the objective. 



7.3 The three variants of HGLasso 

We consider the three version of HGLasso. 



HGLa: Output data y are split in a training and validation data set. The optimization problem 
(77 1 is solved using only the training data obtaining k. The regularization parameter 7 is es- 
timated using the forward- selection procedure described in the previous subsection equipped 
with cross-validation. This procedure also returns the set Ifs containing the indexes of the se- 
lected variables different from zeros. This index set gives an estimate A/75 of the hyperparam- 
eter vector, whose components are equal to k for / G Ips, and zero otherwise . Finally, 6hgl 
is estimated by the formula given in (21 1, E(0 | y, Xps) = blockdiag((AFs);7j..)G^ry(Af 5)^^^, 
using all the available data, i.e. the union of the training and validation data sets. 
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Figure 4: Comparison with MKL/GLasso (section |8.1| ). Boxplot of the percentage errors in the 
reconstruction of 6 (top) obtained by the 4 estimators after the 300 Monte Carlo runs in 
Experiment #1 (top panel) and #2 (bottom panel). 



HGLb: The optimization problem ( 19 1 is solved using the Projected Quasi-Newton method 
with starting point defined by the Xfs returned by HGLa. The regularization parameter 7 is 
set to the estimate obtained by HGLa, 7. Once the new estimate of A is obtained, Ohgl is 



computed using (21 1. 



HGLc: This estimator performs the same operations as HGLb except that the components 
of A set to zero by HGLa are kept at zero, i.e. A, =0, i ^ Ips- In addition, the regularization 
parameter 7 is set to zero in order to obtain the MSE properties in the reconstruction of the 



blocks different from zero established in Proposition 15 Hence, the problem ( 19 1 is optimized 
with 7 = and only over those A; for i G Ips- 



8. Numerical experiments 
8.1 Simulated data 



We consider two Monte Carlo studies of 300 runs each on the linear model (15]) with p = 10 groups, 
each composed of kj = 4 parameters, and n = 100. For each run, 5 of the 0'^^roups are set to zero, 
one is always taken different from zero while each of the remaining 4 6^'^ groups are set to zero 
with probability 0.5. The components of every 0''' block not set to zero are independent realizations 
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from a uniform distribution on [—a, , a,] where a, is an independent realization (one for each block) 
from a uniform distribution on [0, 100]. The value of is equal to the variance of the noiseless 
output divided by 25. The noise variance is assumed unknown and its estimate is determined at each 
run as the sum of the residuals coming from the least squares estimate divided hy n — m. The two 
experiments differ in the way the columns of G are generated at each run. In the first experiment, the 
entries of G are independent realizations of zero mean unit variance Gaussian noise. In the second 
experiment the columns of G are correlated, being defined at every run by 

Gi.j = Gi.j-\+0.2vi.j-\, i = \,..,n, j = 2,..,m 
Vij ~ ^(0, 1) 

where Vij are i.i.d. (as / and j vary) zero mean unit variance Gaussian and G,;i are i.i.d. zero mean 
unit variance Gaussian random variables. Note that correlated inputs renders the estimation problem 
more challenging. 

We compare the performance of the following 4 estimators. 

• HGLa,HGLb,HGLc: These are the three variants of our HGLasso procedure defined at the 
end of Section [7] The data is split into training and validation data sets of equal size and the 
grid C used by the cross validation procedure to select y contains 30 elements logarithmically 
distributed between 10"^ x k'"' and 10"^ x jc"^ 

• MKL (GLasso): The regularization parameter is determined via cross validation, splitting 
the data set in two segments of the same size and testing a finite number of parameters from a 
grid with 30 elements logarithmically distributed between 10^^ x f and 10^ x 7 where 7 is the 
regularization parameter adopted by the three HLasso procedures. Finally, MKL (GLasso) is 
reapplied to the full data set fixing the regularization parameter to its estimate. 

The 4 estimators are compared using the two performance indexes listed below: 

1 . Percentage estimation error: this is computed at each run as 

100 X -^i^T^^ % (79) 
ll^ll 

where 6 is the estimate of 6. 

2. Percentage of the blocks equal to zero correctly set to zero by the estimator after the 300 runs. 

The top and bottom panel of Fig. [4] displays the boxplots of the 300 percentage errors obtained by 
the 4 estimators in the first and second experiment, respectively. It is apparent that all of the three 
versions of HGLasso outperform GLasso. 

In Table [T] we report the sparsity index. One can see that in the first and second experiment 
the first and third version of HGLasso obtain the remarkable performance of around 99% of blocks 
correctly set to zero, while the second version obtains a value close to 76%. Instead, in the two 
experiments GLasso (MKL) correctly set to zero no more than 40% of the blocks. This result, 
which can appear surprising, is explained by the arguments in Sections [5] and |6| in a nutshell, 
GLasso trades sparsity for shrinkage. The value of the regularization parameter 7 needed to avoid 
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HGLa 


HGLb 


HGLc 


MKL (GLasso) 


Experiment #1 


99.2% 


76.1% 


99.2% 


36.1% 


Experiment #2 


99.0% 


76.5% 


99.0% 


39.5% 



Table 1: Comparison with MKL/GLasso (section 8. 1 1. Percentage of the 0^') equal to zero correctly 
set to zero by the four estimators. 



oversmoothing is not sufficiently large to induce "enough" sparsity. This drawback does not affect 
our new nonconvex estimators. These estimators have the additional advantage of selecting the 
regularization parameters leading to more favorable MSE properties for the reconstruction of the 
non zero blocks, as discussed in Section[6]and illustrated in Section[5]in a simplified scenario. 

8.1.1 Testing A VARIANT OF HGLa 

To better point out the role played by 7 in our numerical schemes, we have also considered a variant 
of HGLa where 7 is always set to and the parameter is used to induce sparsity. More precisely, 
the only difference with respect to HGLa is that, after obtaining from least squares and deter- 
mining k, is re-estimated using the forward-selection procedure equipped with cross-validation. 
For the sake of comparison, we have considered 3 Monte Carlo studies of 300 runs. Data are gen- 
erated as in the second experiment described above except that the 3 cases exploit different values 
of (7^ equal to the noiseless output variance divided by 5 (case a), 2 (case b) or 1 (case c). Table [2] 
reports the mean of the 300 percentage errors while Table [3] reports the sparsity index. 
It is apparent that the variant of HGLa performs quite well, but HGLa outperforms it in all the ex- 
periments. These results can be given the following interpretation. When one adopts HGLa, the 
"high level" of the Bayesian network depicted in Fig. [l](b) (represented by 7) is used to to induce 
sparsity with the parameters entering the lower level of the Bayesian network that need not to be 
changed. Thus, 6 is eventually reconstructed adopting the estimated from data and the A deter- 
mined by marginal likelihood optimization, thus possibly exploiting the MSE properties reported in 



Proposition 15 When 7 is instead set to 0, the estimator has to trade sparsity and shrinkage using a 
less flexible structure. In particular, the lower level of the Bayesian network is now also in charge of 
enforcing sparsity and this can be done only increasing a^, possibly loosing performance in terms 
of MSE. 

8.2 Comparison with Adaptive Lasso 

In this section we compare the performance of HGLa with that obtainable by the Adaptive Lasso 
(AdaLasso) procedure introduced in ( |Zou[|2006] ). In particular, we consider an example taken from 



(Zou 2OO61 where the components of 6 are {3, 1.5,0,0,2,0,0,0} and each component represents a 
group (block size is equal to 1). The rows of the design matrix G are independent realizations from 
a zero mean Gaussian vector, with (/,7)-entry of its covariance equal to jS''^^'. To be more specific, 
we consider 6 Monte Carlo studies, each of 200 runs, where at each run j8 is drawn uniformly from 
the open interval (0.5, 1). The 6 experiments then differ in the number of data used to reconstruct 6 
(20 or 60) and in the variance of the Gaussian measurement noise (a^=l, 9 or 16). We implemented 
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HGLa 


Variant of HGLa 


Experiment #2 (case a) 


12.2% 


15.6% 


Experiment #2 (case b) 


30.1% 


39.2% 


Experiment #2 (case c) 


61.5% 


73.8% 



Table 2: Comparison with the variant of HGLa (section [8. 1.1 1. Mean of the percentage errors in 



the reconstruction of 6 obtained by HGLa and by the variant of HGLa where sparsity is 
induced by a^. 



HGLa Variant of HGLa 
Experiment #2 (case a) 99.2% 93.1% 
Experiment #2 (case b) 96.2% 86.5% 
Experiment #2 (case c) 88.1% 71.4% 



. Percentage of the 0^'' equal to zero 



Table 3: Comparison with the variant of HGLa (section 8.1.1 

correctly set to zero by HGLa and by the variant of HGLa where sparsity is induced by 

^2 



HGLa and Lasso as described in the previous subsection. For what regards AdaLasso, as in ( Zou 



2006 1 we exploited two-dimensional cross validation to estimate the regularization parameter and 



the variable rj defining the weights. In particular, the latter were set to the inverse of the absolute 
value of the least squares estimates raised at rj, where Tj may vary on the grid [0.5, 1, ... ,4]. 
Results are summarized in Table|4] that reports the mean of the 200 percentage errors, and in Table[5 
where the sparsity index is displayed. One can notice that HGLa outperforms Lasso and AdaLasscT 
achieving both a smaller reconstruction error and a better sparsity index. To further illustrate this 
fact, at each Monte Carlo run we have also computed the Euclidean norm of the estimates of the 
null components of 6 returned by the three estimators, divided by the norm of the true 6. Since 
the number of null components of is 5 and the overall number of Monte Carlo runs is 1200, 6000 
values were stored. Fig. [5] plots them (as a function of the Monte Carlo run) as points when the 
estimated value is different from zero (no point is displayed if the corresponding value is zero). It 
is apparent that in this example HGLa correctly detects the null components of 6 more frequently 
than Lasso and AdaLasso, also providing a smaller reconstruction error when a component is not 
set to zero. 



8.3 Real data 

In order to test the algorithms on real data we have considered thermodynamic modeling of a small 
residential building. We placed sensors in two rooms of a small two-floor residential building of 

4. In this experiment AdaLasso enforces more sparsity than Lasso but leads to larger reconstruction errors on since it 
tends more frequently to set to zero also components of 6 that are not null. 
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HGLa 


Lasso 


AdaLasso 


Exp. #1 (n=20, 


= 1) 


30.2% 


34.5% 


38.2% 


Exp. #2 (n=60, 


= 1) 


12.1% 


15.3% 


17.1% 


Exp. #3 (n=20, 


= 9) 


68.5% 


81.2% 


100.1% 


Exp. #4 (n=60, 


= 9) 


43.1% 


46.3% 


56.6% 


Exp. #5 (n=20, 


= 16) 


78.7% 


110.1% 


141.2% 


Exp. #6 (n=60, 


= 16) 


53.7% 


60.0% 


73.6% 



Table 4: Comparison with AdaLasso (section 8.2 1. Mean of the percentage errors in the reconstruc- 
tion of Q obtained by the five estimators. 









HGLa 


Lasso 


AdaLasso 


Exp. #1 (n: 


=20, 


= 1) 


87.9% 


38.5% 


69.1% 


Exp. #2 (n: 


=60, 


= 1) 


96.5% 


45.6% 


75.2% 


Exp. #3 (n: 


=20, 


= 9) 


80.9% 


49.6% 


61.0% 


Exp. #4 (n: 


=60, 


= 9) 


87.4% 


46.8% 


69.4% 


Exp. #5 (n= 


20, a2 : 


= 16) 


78.4% 


51.4% 


59.3% 


Exp. #6 (n= 


60, a2 : 


= 16) 


85.7% 


41.1% 


65.5% 



Table 5: Comparison with AdaLasso (section 8.2 1. Percentage of the components of Q equal to zero 
correctly set to zero by the five estimators. 
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Figure 5: Comparison with AdaLasso (section 8.2 1. Euclidean norm of the estimates of the null 
components of d returned by the three estimators (divided by the norm of the true 0) as 
a function of the Monte Carlo runs performed in the 6 experiments. The values different 
from zero are displayed as points while no point is displayed if the obtained estimate is 
zero. 
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141 



Figure 6: Nodes location: 8 nodes each equipped with 3 sensors: temperature, humidity and total 
radiation. 



about 80 m^ and 200 m^ ; the sensors have been placed only on one floor (approximately 40 m^ ) 
and their location is approximately shown in Figure [6] The larger room is the living room while the 
smaller is the kitchen. The experimental data was collected through a WSN made of 8 Tmote-Sky 
nodes produced by Moteiv Inc. Each Tmote-Sky is provided with a temperature sensor, a humidity 
sensor, and a total solar radiation photoreceptor (visible + infrared). The building was inhabited 
during the measurement period, which lasted for 8 days starting from February 24th, 201 1 ; samples 
were taken every 5 minutes. The heating systems was controlled by a thermostat; the reference 
temperature was manually set every day depending upon occupancy and other needs. 
The location of the sensors was as follows: 

• Node #1 (label 111 in Figure |6]l was above a sideboard, about 1.8 meters high, located close 
to thermoconvector. 

• Node #2 (label 137 in Figure[6]) was above a cabinet (2.5 meters high). 

• Node #3 (label 139 in Figure[6]) was above a cabinet (2.5 meters high). 

• Node #4 (label 140 in Figure[6]) was placed on a bookshelf (1.5 meters high). 

• Node #5 (label 141 in Figure [6]) was placed outside. 

• Node #6 (label 153 in Figure[6]) was placed above the stove (2 meters high). 
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Figure 7: Measured temperatures (left) and humidity (right), first 40 hours. 



• Node #7 (label 156 in Figure [6]) was placed in the middle of the room, hanging from the 
ceiling (about 2 meters high). 

• Node #8 (label 160 in Figure [6]) was placed above one radiator and was meant to provide a 
proxy of water temperature in the heating systems. 

This gives a total of 24 sensors (8 temperature + 8 humidity + 8 radiation signals). A preliminary 
inspection of the measured signals (see Figure|7]) reveals the high level of coUinearity which is well- 
known to complicate the estimation process in System Identification ( Soderstrom and Stoica 1989| 
qungl[T999HBox etal.j ). 



We only consider Multiple Input-Single Output (MISO) models, with the temperature from each 
of the nodes as output (yt) and all the other signals (7 temperatures, 8 humidities, 8 radiations) as 
inputs (wj, / = 1, ..,23|^. We leave identification of a full Multiple Input-Multiple Output (MIMO) 
model for future investigation. We split the available data into 2 parts; the first, composed of N^i = 
700 data points, is used for learning and validation and the second, composed of Ntest = 1500 data 
points, is used for test purposes. The notation y"^ identifies the training and validation data while 
ytest identifies the test data. Note that Nij = 700, with 5 minute sampling times, corresponds to ~ 
58 hours; this is a rather small time interval and, as such, models based on these data cannot capture 
seasonal variations. Consequently, in our experiments we assume a "stationary" environment and 
normalize the data so as to have zero mean and unit variance before identification is performed. 

Our two main goals are as follows. 

I. Provide meaningful models with as small data set as possible. This has clear advantages if 
identification is being performed for, e.g., certification purposes or as a preliminary step for 
deciding, having monitoring/control objectives in mind, how many sensors are needed and 
where these should be installed. 



5. Even though one might argue that inside radiation does not play a role, we prefer not to embed this knowledge in 
order to make identification more challenging. After all, even though our experimental setup has a small number 
of sensors, a full scale monitoring system for a large building may have hundreds of sensors; in this scenario input 
selection is in our opinion a major issue. 
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2. Provide sensor selection rules in order to reduce the number of sensors needed to effectively 
monitor the environment. A setup we have in mind is the following: one first deploys a large 
number of nodes, collects data and performs identification experiments. As an outcome, in 
addition to the models, we identify a subset of sensors that are sufficient to effectively monitor 
the environment; based on the measurements from this subset of sensors one can then reliably 
"predict" the evolution of temperature (and possibly humidity) across the building. 



We envision that model predictive based methodologies, (see (Camacho and Bordons 2004 1 and 
the recent papers ( Yudong et al.[ 2010 1, (Prvara et al. 2011| l, ( Dong et aL| 2008| )), may be effective 
for these applications and, as such, we evaluate our models based on their ability to predict future 
data. The predictive power of the model is measured for ^-step-ahead prediction on validation data, 
as: 



CODt := 1 



t=k yyt 

Latest {,,te 
t=k yyt 



testXl 



(80) 



LNtesl ..test 
t=\ yt 



where y := tt 

We consider AutoRegressive models with exogenous inputs (ARX) ( |Ljung[ 1999; Soderstrom 
and Stoical |1989t [Box et al.) of the form 



23 q 



yt 



Y^hkAyt-k + Y^Y^h,i+\-u\_,^ + et . 



=u=i 



This model is linear in the parameters (htj, k = I, ..,q, i = 1,..,24) and as such falls within the 
general structure ([T]). Experiments with different lengths q were investigated. We report only the 
results for q = 20 which seemed to be the most reasonable choice for all methods. Note that, with 
reference to ( [T5] ), here we have p = 24 groups of ki = q = 20 parameters each, for a total of 480 
parameters. 

We compare the performance of the following 2 estimator: 

• HGLc: the variant of HGLasso procedure defined at the end of Section |7] Identification data 
are split in a training and validation data set of equal size and the grid C built around k^^ 
used by the cross validation procedure to select 7 turns out to be [25 : 25 : 1000]. 

• MKL (GLasso): the regularization parameter is estimated by cross validation using the same 
grid C adopted for HGLc. 



The 2 estimators are compared using as performance indexes the CODk defined in equation ([80]). 
Sample trajectories of one-step-ahead prediction on both identification and test data are displayed 
in Figures [8] and [9] while 5-hours (= 60 steps) ahead prediction is shown in Figure 10 



The CODjt up to 16 hours ahead for ARX models are plotted in Figures 11 while Figure 12 
shows the norm of the estimated impulse responses hk^, i = 1, ..,24. 

It is clear that HGLc performs better than MKL in terms of prediction while achieving a higher 
level of sparsity. Note also that the higher sparsity achieved by HGLc results in a much more stable 
behavior in terms of multi-step prediction (see Figures 10 and[TT]). These results are in line with the 
theoretical findings in the paper as well as with the simulation results on synthetic data on Section 



6. We do not report the performance of GLasso which was similar (if not worse) than MKL, in line with the synthetic 



experiments in Section 8.1 
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Training and validation data 
-HGLc 




30 
Hours 



Training and validation data 
-IVIKL 




30 
Hours 



Figure 8: ARX model: training and validation data (•) with output (solid line) estimated by HGLc 
(top) and MKL (bottom). 
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Figure 9: ARX model: test data (•) and prediction (solid line) obtained by HGLc (top) and MKL 
(bottom). 
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Figure 10: ARX model: test data (•) and 5-hours ahead prediction (solid line) obtained by HGLc 
(top) and MKL (bottom). 
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Figure 1 1 : ARX model: coefficient of determination as a function of the prediction horizon (one 
step = 5 minutes) using HGLc (o) and MKL (*). 
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Figure 12: ARX model: norm of estimated impulse responses using HGLc and MKL. 
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9. Conclusions 

We have presented a comparative study of two methods for sparse estimation: GLasso (equivalently, 
MKL) and the new HGLasso. They derive from the same Bayesian model, yet in a different way. 
The pecuUarities of HGLasso can be summarized as follows: 

• in comparison with GLasso, HGLasso derives from a marginalized joint density with the 
resulting estimator involving optimization of a non-convex objective; 

• the non-convex nature allows HGLasso to achieve higher levels of sparsity than GLasso with- 
out introducing too much regularization in the estimation process; 

• the MSB analysis reported in this paper reveals the superior performance of HGLasso also in 
the reconstruction of the parameter groups different from zero. Remarkably, our analysis elu- 
cidates this issue showing the robustness of the empirical Bayes procedure, based on marginal 
likelihood optimization, independently of the correctness of the priors entering the stochastic 
model underlying HGLasso. It also clarifies the asymptotic properties of ARD; 

• the non-convex nature of HGLasso is not a limitation for its practical application. Indeed, the 
Bayesian Forward Selection used in HGLa provides a highly successful initialization proce- 
dure for the regularization parameter 7 (7), an initial estimate for A (using k), and an initial 
estimate of the non-zero groups (Ifs)- This procedure requires only the solution of a one 



dimensional version of the basic problem ( 19 1 



Notice also that, being included in the framework of the Type II Bayesian estimators, many vari- 
ations of HGLasso could be considered, adopting different prior models for A . In this paper, the 
exponential prior has been used since the goal was the comparison of different estimators that can 
be derived from the same Bayesian model underlying GLasso. In this way, it has been also shown 
how, starting from the same stochastic framework, an estimator derived from a suitable posterior 
marginalization can have signicant advantages over another one derived from posterior optimiza- 
tion. 

All theoretical findings have been confirmed by experiments involving real and simulated data, also 
comparing the performance of the new approach with adaptive lasso. The aforementioned version 
of HGLasso has been able to promote sparsity correctly detecting a high percentage (in some ex- 
periments also equal to 99%) of the null blocks of the parameter vector and to provide accurate 
estimates of the non-null blocks. 



10. Appendix 

10.1 Proof of Proposition |5] 

Given 7 > 0, the maximum a posteriori estimate for (0 , A ) given y is obtained by solving the problem 

Minimizing first in <p allows us to write ^ as the following function of A : 

0(A) := {a^I + A^/^G'^GA^/^)-^A^/^G'^y = A^^^G'^{a^I + K{l))-^y, (82) 
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where the second expression follows from the Matrix Inversion Lemma. Substituting this back into 
( 81 1 yields the optimization problem ( 34 1. Hence, if X is as defined in ( 34 1, then ( 36 1 follows from 



Now, we show that the pair (c, A ) described by ( 34 1 and ( 36 1 solves ( 3 1 1 for some value of 7 > 0. 



For this, we need only show that the pair (c, A) is a local solution to (31 1 for some 7 > 0. To this 



end, observe that ( 29 1 is coercive in / (the objective goes to 00 as the norm of / goes to 00) and A is 



constrained to stay in a compact set, so that a solution exists. Consequently a solution to (31 1 exists. 



Let 7 > be the Lagrange multiplier associated with a local solution (c*, A*) to (31 1 (7 exists since 



the constraint is linear). We show that (c*, A*) = (c, A). Since 7is the Lagrange multiplier, (c*, A* 
is a local solution to the problem 



mm 



{y-K{X)cy{y-K{X)c) 
a2 



(83) 



As above, first optimize ( 83 1 in c to obtain 

c{X) = {a^I + K{l))-'y. 



Plugging this back into the objective in ( [83] ) gives the objective 

y^{o^l + K{X))-'y + Y\'X 



which establishes (34i and (35l for (c*,A*), and hence the pair (c*,A*) satisfies (34l and (36l by 
the first part of the proof and solves ( [3T] ) by definition. Finally, ( 37 1 can be obtained reformulating 
the objective ( 81 1 in terms of d = A^i^ and A (in place of and A), and then minimizing it first in 
A. 

10.2 Proof of Proposition [9] 

Under the simplifying assumption G = nl, one can use ([ojl to simplify the necessary conditions 
for optimality in ([23]). By ([9]l, we have 

1 



nXi + 



G 



(0 



and so 



tr(G»Tv7>G» 



nki 



^Xi + o^ 



and ||G«Tr-Vlli 



|0(O„2 



Inserting these expressions into ([23]) with jii = yields a quadratic equation in A,- which always has 
two real solutions. One is always negative while the other, given by 



1 

47 



is non-negative provided 



'? + 8rlieSlP-(*, + ^) 



n 



1 + 



nki 



(84) 
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This concludes the proof of (46 1. The Umiting behavior for 7 — can be easily verified, yielding 



/ ||0(')||2 2\ 

A,(0) = max(0,^^- — J / = !,.., p. 

Also note that 0^5 = ^ (G^'^)^ j and (G^'^)^ G^'^ = nh, while (G^'^)^ G^^^ = 0, V; / i- This implies 
that d^ll ~ ^4,.). Therefore 



This, together with ( 84 1, proves also (48 1. 



10.3 Proof of Proposition [To] 

In the proof of Proposition 9 it was shown that WQ^l follows a noncentral distribution with 



ki degrees of freedom and noncentrality parameter Hence, it is a simple calculation to 

show that 

E[A;|0 = 0] = ^X Var[A;|e = e] = -^+ " " . (85) 
ki kjU^ kfn 

By Corollary [sj the first of these equations shows that IE[A;* \ d = 6] = X"^\ In addition, since 
Var{A;*} goes to zero as « — )• 00, A* converges in mean square (and hence in probability) to 
As for the analysis of A;(0), observe that 

E[AK0) 1 = e] = e[a; \q = q\- £^ (^^^ - dP{\\dSf \ e = e) 

where dP{\\eHf | = 0) is the measure induced by ||0^^^|p. The second term in this expression 
can be bounded by 

where the last term on the right hand side goes to zero as n — )• 00. This proves that A, (0) is asymptoti- 
cally unbiased. As for consistency, it is sufficient to observe that Var[/l/(0) | = 0] < Var[A* | = 0] 
since "saturation" reduces variance. Consequently, Xi{0) converges in mean square to its mean, 
which asymptotically is X"^' as shown above. This concludes the proof. 

10.4 Proof of Proposition [12] 

Following the same arguments as in the proof of Proposition [o] under the assumption G^G = nl we 
have that 

2 



|r-(')Tv-U,||2 _ ( " \ ||fl(0||2 
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Inserting this expression into ( |38| ) with ^u, = 0, one obtains a quadratic equation in A, which has 
always two real solutions. One is always negative while the other, given by 

11^211 



is non-negative provided 



This concludes the proof of (53 1. 



The limiting behavior for « — )• oo in equation (|54]l is easily verified with arguments similar to 
those in the proof of Proposition 10 As in the proof of Proposition 1 1 d'll | p ^ follows a noncentral 

X^{d,jx) distribution with d = ki and /i = 1 1 | p ^ , so that from ([86]) the probability of setting A; (7) 
to zero is as given in (|55|). 



10,5 Proof of Theorem [13 

Recalling model ( [T5| ), assume that G/n is bounded and bounded away from zero in probability, 
so that there exist constants 00 > Cmax > Cmin > with 

lim P[c„„„/ < G^G/n < c,naJ] = 1 , (87) 
SO as n increases, the probability that a particular reahzation G satisfies 

CmiJ <G^G/n<CmaxI (88) 

increases to 1 . We now characterize the behavior of key matrices used in the analysis. 
We first provide a technical lemma which will become useful in the sequel: 

Lemma 16 Assume ( |88| ) holds; then the following conditions hold 

(i) Consider an arbitrary subset I = [/(I), ... ,/(/?/)] of size pi to be any subset of the indices 
[I, . . . ,p\, so p < pi and define 

g(^) = [G«i))...G«P'))] , (89) 
obtained by taking the subset of blocks of columns ofG indexed by I. Then 

Cmin^ ^ ^ Cmax^ • (90) 

n 

( ii) Let /'^ be the complementary set of I in [I,. . . ,p], so that I'^DI = (Z) and lUl'^ = [I,. . . ,p]. The 
minimal angle dmin between the spaces 

:= colspan{G^''> /^/^, i G /} and := colspan{G^j^ : j G /''} 

satisfies: 

dmin > aCOS I W 1 - 1 > 
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Proof Result ( |90l ) is a direct consequence of (Horn and Johnson 1994[ Corollary 3.1.3). As far 
as condition (ii) is concerned we can proceed as follows: let Ui and Ujc be orthonormal matrices 
whose columns span '^^ and , so that there exist matrices Tj and Tjc so that 

where G^^'^' is defined analogously to G^'\ The minimal angle between '^^ and ^^'^ satisfies 

cos(0„„-„) = UjUic. 

Now observe that, up to a permutation of the columns which is irrelevant, Gj^/n = [?7/r/ ?7/iT/c], 
so that 



UJG/^=[T, uJUicTic] = [1 uJUic 



Tj 

Tjr 



Denoting with (J,nin{A) and (Jmax{A) the minimum and maximum singular values of a matrix A, it is 
a straightforward calculation to verify that the following chain of inequalities holds: 



c,„in = o,MG'G/n)<a^^,{Uj'G/^) = a^.^ (^[/ u/U, 



T, 
Tic 



2 

max 





" 7) 





( 








= <in{[l ^/^K])max(a,L(7»,ai,(r;0) 



Observe now that a^;„ ([/ UjUjc]) = 1 - cos2(0„„„) so that 

Cmin — (1 ''OS (0m;n))Cni 

and, therefore, 



COS^(0m;„) < 1 



from which the thesis follow. 



Proof of Lemma 14 ; Let us consider the Singular Value Decomposition (S VD) 



PSP^; 



where, by the assumption (88 1, using 



and lemma 16 the minimum singular value Omin{S) of 5 in ( |9T] ) satisfies 

{S) > c„„„min{Ay, j : Xj / 0}. 

Then the SVD of Zv = Ly=i j-^,- G'^'^ ^ Ay + o^I satisfies 



(91) 

min{Ay,j : Ay / 0} 
(92) 



[ P Pi 



(«5 + a2)-i 
a 2/ 
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so that I IE; 



1 



.-2 



Note now that 



n 



(0 



and therefore, using Lemma 16 



^ ||E,- 1 1 -y/c mcLX — \fCmax- 



16 



condition ([88]) implies that V^jZ? 



proving that D^'^ is bounded. In addition, again using Lemma 

(of suitable dimensions) s.t. = = 1, a^^-^^h >k,k= 1./I - cos2(0,„,„) > -e™ > q. This, 
using (62 1, guarantees that 



,(0 



g(o 



and therefore dI'' is bounded away from zero. It is then a matter of simple calculations to show that 
with the definitions (64) then ( [6T] ) can be rewritten in the equivalent form ( [63] ). □ 



Lemma 17 Assume, w.l.o.g., that the blocks of 6 have been reordered so that || 7^ 0, j = 1, 
ant/ II =0, 7 = ^+1, ..,mand that the spectrum ofG^G/n is bounded and bounded away fi 
zero in probability, so that 



rom 



limn^ooP[CmaxI > G^G/n > Cminl] = 1. 



(93) 



Denote 

h :={7G[1,^], 
h:={j(^[k+\,pl j^i} 

and assume also that the numbers A", which are here allowed to depend on n, are bounded and 
satisfy: 

(94) 



lim /„ = +00 where /„ := min nA^ 



Then, conditioned on G, ei in ( 64 1 and ( 63 1 can be decomposed as 



The following conditions hold: 



en 



--ms„{e) = Op 



/Tn 



Ve„ = Op 



(95) 



(96) 



so that En^ 1 6 converges to zero in probability (as n ^ oa). In addition 



= 0p 



(97) 
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If in addition^ 



Op{\) ■,j=l,..,k j^i 



(98) 



then 



(99) 



Proof Consider the Singular Value Decomposition 



(100) 



Using ( |94l ), there exist n so that, \/ n> nwe have < A" < M < oo, j £ Ii. Otherwise, we could find 
a subsequence n^ so that AJ* = and hence = 0, contradicting (94). Therefore, the matrix Pi 

in ( |100| ) is an orthonormal basis for the space := col span{G'^^^ / y/n : j £ /i}. Let also r, be such 
that G^j^ l^/n = PiTj, j £h. Note that by assumption ([87]l and lemma 



16 



(101) 



Consider now the Singular Value Decomposition 



[Pi Po 



Si 
So 



Pi 
P^ 



PiSiP^ 



+ 



A. 



(102) 



For future reference note that 3Tp^ : Pi = [ Pi Pq ] Tp^ . Now, from ( |62| ) we have that 



(103) 



Using (103 1 and defining 



P:=[Pi Po] S: 



Si 
So 



7. This is equivalent to say that the columns of G'-'', j = I, ..,k, j ^ i are asymptotically orthogonal to the columns of 
G('). 
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equation ( 64 1 can be rewritten as: 



(0 



Tj{i 



-1 



(0 







(GO) 


T 

-[ 


\fn 




(GO) 


T 




-[ 









2t\-1 



P{nS + oH) 

















V 


a 2/ 





















a 2/ 




.Pi . 







0(7) + . 



(0 



-1 



(0 



me„(e) 
[ P Pi 



{nS + a^iy^ 




a 2/ 



where the last equation defines me„ (6) and Vg^^, the noise 

PT 



is still a zero mean Gaussian noise with variance a^I and ^ = P\T^j^ provided j / /. Note that 
me„ does not depend on v and that EyVg,, = 0. Therefore /«e„(6) is the mean (when only noise v is 
averaged out) of e„. As far as the asymptotic behavior of me„(0) is concerned, it is convenient to 
preliminary observe that 



1 


" Pi Pi ■ 






. Plp^ . 





{nSi+0^1)-^PjPx 
inSo + o'-I)-^PjPi 



and that the second term on the right hand side can be rewritten as 

in[So]u+O^I)-'Pj^,Pi 

(n5o + a2/)-iPoTp, 



{niSohl + O^ir'Pj^Pi 



1 

n[So],n-L„r-k + O^I)-'P^,„_,^Pl 



(104) 



where [Sola is the / — th diagonal element of and Pq,, if the / — th column of Pq. Now, using 
equation ( |102 1 one obtains that 

n[So]n = PlPnSP^Poj = PI^ {PmSiP^ +nA) Po^i 

> P^PmSiP^Poj 

> 0,„in{nSi)Pfl/iP^Poj 
= 0,„in{nSi)\\PfliPif. 



With an argument similar to that used in ( [92] ), also 

{nSi)> c™„min{«A", j £ Ii} = c,tiinfn 



(105) 
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holds true; denoting ||/()^,.Pi || = g,,, the generic term on the right hand side of ( 104 1 satisfies 

< kmm{gn,{f„gn) 



^mm{^/Jngn,{^/Jngn) ^] 



(106) 



< 



for some positive constant k. Now, using lemma [141 D^'^ is bounded and bounded away from zero 
in probability, so that \\ = Op{\) and || (d,?^ h t„ oHhih^^ v(') 



matrix and 



Op{\). In addition y,)^ is an orthonormal 

Op{\). Last, using (fT05]) and (|93|), we have + a^)"' || =Op{l/n). Com- 

As far as the asymptotics 



bining these conditions with ( |101| ) and ( |106| ) we obtain the first of 
on Vp are concerned, it suffices to observe that 

wJvp/^/n = Op{\/^/n) if: ||w„|| = Op(l). 

The variance (w.r.t. noise v) Var^,{en} = Ev [^enVfT] satisfies 

so that, using the condition || = derived in Lemma [14] and the fact that Un^ has orthonor- 
mal columns, the condition Var^,{en} = Op (^) in ( |97] ) follows immediately. 
If in addition ([98]) holds then ( |101| ) becomes 

\\Tj\\ = Op{l/V^) j=l,...,k; j^k 



so that and extra ^/n appears at the denominator in the expression of mp{Q) yielding (99l. This 
concludes the proof. ■ 



Our next results will focus on the estimator 



We will show that when the hypotheses of 



Lemma [T4[ hold, estimator ( [T9[ ) satisfies the key hypothesis of Lemma[T7] We first take a close look 
at the objective ( [T9] ). 



\\eu)ixw I ||0O-)(A)f 



Lemma 18 Take objective ( [19] ) divided by n: 

gniX) = logaH ^ logdet(a-2r,(A)) + ^ 



Si 



Si 



S3 



(107) 



54 



55 



where d(X) = AG^Ly (see (|21|)j, kj is the size of the jth block, and dependence on n has been 
suppressed. For any minimizing sequence A", we have the following results: 
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1- -^p 6- 

2. 51,52, 53,54 — i>p 0. 

4. nXj -^p 00 for all j £ h. 
Proof First, note that < 5; for / e {1,2,3,4}. Next, 

55 = ^lb-lG^0(^')(A)+£G^-(0W(A)-0W(A)) f 



(108) 

The first term converges in probability to ^. Since v is independent of all G^, the middle term 
converges in probability to 0. The third term is the bias incurred unless 6 = 6. These facts imply 
that, V£ > 0, 

Ssmn))>l-e 



]imP 



1 



(109) 



Next, consider the particular sequence A" = -^j^- For this sequence, it is immediately clear that 

5,- for / G {2,3,4}. To show 5i 0, note that 'ZXiGjCj < max{A,}i:G,Gf , and that the 
nonzero eigenvalues of GG^ are the same as those of G^G. Therefore, we have 



•^1 - ^max{A}cma^) = Op 



i=\ 



Finally S5 -^p j by ( fT08] ), so in fact, Ve > 0, 



limP 



g„(A(«))---log(a2) 



< £ 



log(w) 
n 



1 . 



,0. 



(110) 



Since ( 110 ) holds for the deterministic sequence A„, any minimizing sequence A„ must satisfy, 
V£ >0, 



hmP 



gnC^{n))<^+log{o^)+e 



1 



(111) 



which, together with ( |109| ), implies ( |1 10| ) 

Claims 1,2,3 follow immediately. To prove claim 4, suppose that for a particular minimizing 
sequence X (n), we have nXj -f^p °° for 7 G /i . We can therefore find a subsequence where nk" < K, 
and since S2{X{n)) — )-p 0, we must have — s-p 0. But then there is a nonzero bias term 
in ( [T08| ), since in particular - = / 0, which contradicts the fact that X{n) 

was a minimizing sequence. ■ 
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Before we proceed, we review a useful characterization of convergence. While it can be stated 
for many types of convergence, we present it specifically for convergence in probability, since this 
is the version we will use. 

Remark 19 a" converges in probability to a (written a" — )-p a) if and only if every subsequence 
q"(J) of a" has a further subsequence a"^-'^*^^' with a"^-'^'^^^ — t-^ a. 

Proof If a" — s-p a, this means that for any £ > 0, 5 > there exists some n^ § such that for all 
" > «e,5, we have P{\a" — a\>£)<5. Clearly, if a" — )>p a, then for every subsequence 

a"^^) of a". We prove the other direction by contrapositive. 

Assume that a" -f^p a. That means precisely that there exist some e > 0, 5 > and a subse- 
quence <3"(^) so that P{\a — d^'^j^\ > e) > 5. Therefore the subsequence a"'^) cannot have further 
subsequences that converge to a in probabihty, since every term of a"^^-' stays e-far away from a 
with positive probability 5. ■ 

Remark[T9]plays a major role in the next lemma from which Theorem[T3] immediately comes. 



Lemma 20 Let X\ be arbitrary and consider the estimator ( |56| ) along X\ that, in view of ( 63 1, is 
given by: 



A," 



1 

are min - \ 



X + Wk.n 



■log(A+W<:_„) 



■yX 



(112) 



where wt^n •= 1/ ('^('^1 « )^) ^"'^ ^k.n = '^^k^n^i^l (^i n Suppose that the hypotheses of Lemma 
hold, so that Wk.n 0. Then by Lemma 18 we know Lemma 17 applies, so that v^.n — 0. Let 

|0(1)||2 



14 



XJ:- 



We have the following results: 



4y 



A, 



ki 



1. Xj < X\ for all 7 > 0, and limy_^o+ ^\ 



A, 



2. >0a?i<i7>0, we We Af - 

3. >Qandy = Q, we have X^ — rp 

4. ifd^^^ = 0, we have Xl — >p Ofor any value 7 > 0. 
Proof 

1. The reader can quickly check that ^Xj < 0, so Xj is decreasing in 7. The limit calculation 



follows immediately from L'Hopital's rule it is clear that lim^_^o+ Aj = Ai. 



We use the convergence characterization given in Remark 19 Pick any subsequence A"'^'' of 
X". Since {V„(y)} is bounded, by Bolzano- Weierstrass it must have a convergent subsequence 
V, where V satisfies V^V = / by continuity of the 2-norm. The first order optimality 



V. 



n{m) 

conditions for Aj" > are given by 

= /i(A,w,v,T7; 



+ 



+ 7. 



(113) 
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and we have/i(A,O,O,V^0(i)) =Oif and only if A = Af. Taking the derivative we find 

which is nonzero at for any 7, since the only zero is at 2^^^^^ = 2Ai > 2Xj. 

Applying the Implicit Function Theorem to / at (^Xj ,0,0,V^ 6^^^) yields the existence of 
neighborhoods ^ of (0,0,^^0(1)) and W of Xj such that 

/(0(W,V,T]),W,V,T]) =0 V(W,V,T]) e . 

In particular, (/)(0,0, V^e^^)) = Xj. Since iw„(j(t)),v„(^j{k)),Tln{j{k))) -^p (0,0, V^e^^'), we 
have that for any 5 > there exist some kg so that for all n{j{k)) > n{i{kg)) we have 
P{{w„(j(k),v„^^j^k)),r]n(j(k))) < 5. For anything in , by continuity of we have 

Af^^''^'' =(/)(w„Q-(i)),V„(y(^)),T]„Q-(i))) ^p(/>(O,O,vT0(')) =A/. 

These two facts imply that A"'^''^'"' — s-p Xl- We have shown that every subsequence Xf"^^ has 



a further subsequence A"^^^'^''' — >-p Aj'', and therefore Af — )-p A,'' by Remark 



19 



3. In this case, the only zero of ( 113 ) with 7 = is found at Ai, and the derivative of the opti- 
mality conditions is nonzero at this estimate, by the computations already given. The result 
follows by the implicit function theorem and subsequence argument, just as in the previous 
case. 



4. Rewriting the derivative ( 1 13 1 



1 ^ X-Vk-r\l + Wk 

we observe that for any positive lambda, the probabiUty that the derivative is positive tends to 
one. Therefore the minimizer Xj converges to in probability, regardless of the value of 7. 
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